As part of the HLG-MOS challenge, we decided to use the toy data set satgpa. This data set contains 1000 observations and 6 variables
The variable sat_sum is not considered in the analysis because it is fully defined by sat_sum = sat_v + sat_m. We used the RSynthpop package to generate the synthetic data. We then compared the synthetic data to the original data to measure how much information is lost during the process. Finally, we try to measure the privacy offered by the synthetic data set.
Using the R package RSynthpop, the synthetic data was produced using the function syn() and the summaries below were produced from using the function summary() from the synthpop library.
synth_data <- syn(data[,c(1,2,3,5,6)], minnumlevels=2)
## sex sat_v sat_m hs_gpa fy_gpa
## Min. :1.000 Min. :24.00 Min. :29.0 Min. :1.800 Min. :0.000
## 1st Qu.:1.000 1st Qu.:43.00 1st Qu.:49.0 1st Qu.:2.800 1st Qu.:1.980
## Median :1.000 Median :49.00 Median :55.0 Median :3.200 Median :2.465
## Mean :1.484 Mean :48.93 Mean :54.4 Mean :3.198 Mean :2.468
## 3rd Qu.:2.000 3rd Qu.:54.00 3rd Qu.:60.0 3rd Qu.:3.700 3rd Qu.:3.020
## Max. :2.000 Max. :76.00 Max. :77.0 Max. :4.500 Max. :4.000
## sex sat_v sat_m hs_gpa fy_gpa
## Min. :1.000 Min. :24.00 Min. :29.00 Min. :2.00 Min. :0.000
## 1st Qu.:1.000 1st Qu.:43.00 1st Qu.:49.00 1st Qu.:2.75 1st Qu.:1.995
## Median :1.000 Median :49.00 Median :55.00 Median :3.20 Median :2.480
## Mean :1.488 Mean :49.17 Mean :54.48 Mean :3.18 Mean :2.470
## 3rd Qu.:2.000 3rd Qu.:55.00 3rd Qu.:60.00 3rd Qu.:3.70 3rd Qu.:3.022
## Max. :2.000 Max. :76.00 Max. :77.00 Max. :4.50 Max. :4.000
In this section we compare the original data to the synthetic version.
import pandas as pd
import seaborn as sns
from scipy.stats import wasserstein_distance
import matplotlib.pyplot as plt
sns.set_theme(style="ticks")
#Load CSV file in a dataframe
df = pd.read_csv('./data/satgpa.csv')
df = df.drop(columns = ['sat_sum'])
dfsynth = pd.read_csv('./synthpop/synth_data.csv')
sns.pairplot(df, hue = "sex")
# Necessary to render in RStudio
# plt.show()
sns.pairplot(dfsynth, hue = "sex")
# Necessary to render in RStudio
# plt.show()
These plots show the correlation between two variables. This type of data validation is useful if the data is Gaussian (i.e., it is normally distributed). If the data were skewed, this would not be a sufficient test as this only looks at the correlation between two variables at a time. This is an appropriate test if the intended use is to be able to perform calculations based on mean and/or covariance in one-dimension. Otherwise, if the intended use of the synthetic data is to look at relationships between multiple variables (e.g., sex, sat_v, and fy_gpa) then this test would not return that information appropriately.
These heat maps can be used to preliminary verify how similar the synthetic data is to the original data, assuming that the data are normally distributed, by comparing the correlations of two variables in the respective data sets. The heat map makes it easy to visually compare if the two data sets are producing similar results.
To examine relationships between multiple variables at a time, we can look at the Earth Mover’s Distance (Wasserstein) which will be discussed in the next section.
Note:
synthpopgenerates randomized numbers for the synthetic data; therefore, multiple outputs may not produce the same numbers. However, the synthetic data outputs should still yield similar results so that we would only really need to show one comparison to demonstrate that the synthetic data shows similar correlations to the original data.
For this exercise, the team chose to measure the difference between the original and synthetic datasets using the Earth Mover’s Distance (EMD), which is also known as the Wasserstein metric. This metric was selected as it allows for a single score (distance) to be calculated while preserving the detail of data points that can be lost when comparing between aggregates. At it’s core, the EMD compares the distance between the distributions, ideally preserving the positioning in higher dimensional spaces.
More information re: Earth Mover’s Distance can be found here.
In Python, the EMD function was used from the SciPy package, and yielded the following results when applied to the marginals in the dataset:
| Variable | EMD |
|---|---|
| sex | 0.004 |
| sat_v | 0.247 |
| sat_m | 0.266 |
| hs_gpa | 0.022 |
| fy_gpa | 0.018 |
We also measured the EMD distance between the marginal distributions conditonal on the sex variable.
| Variable | EMD (sex=1) | EMD (sex = 2) |
|---|---|---|
| sat_v | 0.236 | 0.513 |
| sat_m | 0.428 | 0.522 |
| hs_gpa | 0.028 | 0.028 |
| fy_gpa | 0.028 | 0.037 |
Earth Mover’s Distance (EMD) between of the distributions A and B. If A and B are not distributions then A is the source and B is the target.
emd2d interprets the two matrices A and B as a distribution over a two-dimensional grid. The distance between the grid points in each direction is defined by xdist and ydist. Both matrices must have the same dimensionality.
emd uses first column of each matrix as the weighs and the remaining columns as location coordinates in a up to four-dimensional space. A and B must have the same number of columns. emdw separates the weights from the location matrices but is otherwise identical to emd. emdr uses the original EMD implementation by Yossi Rubner from Stanford.
If case the two matrices A and B are not densities, the weighted sum of flows is normalized by the smaller total mass of the two. The version of the emd package released on CRAN contains only this implementation and all other functions are just front-ends for the call to emdr.
Distance (dist) to be used for the computation of the cost over the locations. Must be either "euclidean", "manhattan" or a [metric over a] closure taking two vectors and returning a scalar number. The latter case is much less efficient because it requires R evaluation for every possible combination of flows.
The main thing to note about the difference between emd and emd2d is that emd looks at vectors whereas emd2d looks at points; therefore, the latter calculates more combinations than the former. This would also explain why the emd2d result shows a value that is “more distant” than the emd result. Essentially, because emd2d looks at more combinations it has more to iterate over.
Below are the results and run-time for each EMD function:
| Function | Result | Run-time |
|---|---|---|
emd |
1.243486 | 5.467048 secs |
emd2d |
2.925377 | 30.70497 mins |
The result that is returned shows us how close our synthetic data is to the original data. Ideally, you would want a value of 0 (or close to 0). However, since we do not have a pre-determined threshold to gauge proximity it cannot be determined whether values returned are “close enough” to the original data set.
This threshold may differ for users depending on their use case. Nonetheless, this test can show how similar the data are to one another.
Which function should you use? It depends on your use-case.
If the goal is to create synthetic data that is as close as possible to the original data then it might be useful to use emd2d to get the most accurate picture (of the two functions). Keep in mind that this function takes significantly longer to run than emd. Regardless of which option you choose to run as your final test, it would be prudent to run emd first and see that result prior to running emd2d. The reason would be that emd is time-effective and if you return a high value with this function, you would probably return an even higher value with emd2d. In that, if you returned a value above your desired threshold, it might be more valuable to modify the parameters of your synthetic data before testing it further.
Trying to match original records in synthetic data is a simple measure of privacy but a practical one. If a synthetic dataset were to be published, people would likely check whether they can find themselves in there.
If a record is fully replicated in the synthetic data, it is effectively disclosed.
Thankfully finding matches is easy to do in R. An inner-join of the original and synthetic data will quickly return all rows that match 100%.
# Number of 100% matched rows
full_matches = merge(satgpa, synth_bisaillon) %>% nrow()
full_matches
## [1] 37
Even if a record is not fully replicated, pieces of it may still be recognizable. For example, someone’s neighbor might be able to make guesses with part of their data. Or if only 3 variables of a record are recognizable, it may still be perceived as disclosed.
In R it is possible to run the merge() function iteratively on all combinations of columns. The combn() function will return all permutations of a vectors.
# Return all possible samples of size 2 from (a, b, c)
combn(c("a", "b", "c"), 2, simplify = FALSE)
## [[1]]
## [1] "a" "b"
##
## [[2]]
## [1] "a" "c"
##
## [[3]]
## [1] "b" "c"
Partial matches are more likely than full matches. The SAT dataset has a variable for sex, so it is nearly guaranteed that at least one column can be matched. As we try to match more and more columns, we get less matches.
Below are some very quick grid search results. To get more trustworthy results, we would have to run longer searches. These searches were done quickly in only a few hours.
In ./gridsearch/gridsearch.R there are simple examples of an iterative grid search for rsynthpop synthesis.
A grid search is exhaustive trial of “hyper-parameters”, which are settings set by the user. If the user doesn’t know what settings are the best, a grid search will iterate over all of their combinations and return the best scores.
You can set the order that variables are synthesize in rsynthpop. It’s clear what the best order is, so we can try all of the combinations. Since synthesis is random, it’s a good idea to run tests multiple times and average them.
Below are the lowest earthmover scores for the sequences (averages of 3 trials each). sat_sum seems to be a good variable to start with.
## `summarise()` has grouped output by 's1', 's2', 's3', 's4', 's5', 's6'. You can
## override using the `.groups` argument.
## # A tibble: 6 × 9
## s1 s2 s3 s4 s5 s6 method mean_em mean_matches
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 sat_sum sat_v hs_gpa sat_m sex fy_gpa cart 1.27 13
## 2 sat_sum sat_v fy_gpa sat_m hs_gpa sex cart 1.27 18
## 3 sat_sum sat_v fy_gpa sex hs_gpa sat_m cart 1.28 20
## 4 sat_v sat_sum hs_gpa sex fy_gpa sat_m cart 1.30 17
## 5 sex sat_sum hs_gpa sat_v fy_gpa sat_m cart 1.30 10.3
## 6 hs_gpa sat_v sat_sum sex sat_m fy_gpa cart 1.31 11.7
rsynthpop offers many different methods of synthesis. We can perform a grid search on these.
The following worked with the dataset: ctree, cart, rf, ranger, norm, normrank, sample
These didn’t work, either because they’re for categorical data or the dataset needed extra processing: bagging, survctree, lognorm, sqrtnorm, cubertnorm, logreg, polyreg, polr, pmm, passive, nested, satcat
Using the sat_sum sat_v hs_gpa sat_m sex fy_gpa sequence (each run 10 times), we see that the random forest methods give the best scores. However, these earthmover scores are higher than the search
## # A tibble: 6 × 3
## method mean_em mean_matches
## <chr> <dbl> <dbl>
## 1 ranger 1.52 13.4
## 2 rf 1.52 1.3
## 3 cart 1.58 11.6
## 4 ctree 1.68 0.8
## 5 norm 1.81 0
## 6 normrank 1.83 0.3